Data 624 PROJECT 2 (Week 12) Forecasting EDA

Joint Work

Alexander Ng, Philip Tanofsky

Due 12/13/2021

Problem Statement

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Data Wrangling Strategy

We will describe the data wrangling steps in brief before doing any exploratory data analysis and model building. After loading the raw data files, our transformations and changes are as follows:

  1. Add a primary key id to allow unique identification of all raw observations.
  2. Drop observations with NA in the response variable PH.
  3. Drop observations with a high number of missing values. In our case, observations that have more than 4 missing values.
  4. Rename the variables to remove spaces.
  5. Split the initial data set using stratified sampling based on PH value into a 80/20 train and validate data set. We recognize that a test data set exists but this is for external assessment of the project prediction accuracy.
  6. Identify and drop near zero variance predictors in the training set. And apply the same predictor removals to the validation data set (regardless of their values in the latter).
  7. Identify and fix zeros and outliers in the predictors in the training set as follows:
  1. Handle special cases of variables:
  1. Apply the knn imputation strategy to Box-Cox transformed, scaled, centered observations for all training set points.

Along the way, we will assess the numerical impact of such changes to the dataset.

## [1] 2571   33

Let’s evaluate the data characteristics of the raw file after loading. We use the skimr library to get statistics quickly for the entire training data set. The table below shows the numeric variables sorted by their completion rate in descending order. So the most problematic predictors are at the top of the table.

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
MFR 212 0.918 704.049 73.898 31.400 706.300 724.000 731.000 868.600 ▁▁▁▂▇
Filler Speed 57 0.978 3687.199 770.820 998.000 3888.000 3982.000 3998.000 4030.000 ▁▁▁▁▇
PC Volume 39 0.985 0.277 0.061 0.079 0.239 0.271 0.312 0.478 ▁▃▇▂▁
PSC CO2 39 0.985 0.056 0.043 0.000 0.020 0.040 0.080 0.240 ▇▅▂▁▁
Fill Ounces 38 0.985 23.975 0.088 23.633 23.920 23.973 24.027 24.320 ▁▂▇▂▁
PSC 33 0.987 0.085 0.049 0.002 0.048 0.076 0.112 0.270 ▆▇▃▁▁
Carb Pressure1 32 0.988 122.586 4.743 105.600 119.000 123.200 125.400 140.200 ▁▃▇▂▁
Hyd Pressure4 30 0.988 96.289 13.123 52.000 86.000 96.000 102.000 142.000 ▁▃▇▂▁
Carb Pressure 27 0.989 68.190 3.538 57.000 65.600 68.200 70.600 79.400 ▁▅▇▃▁
Carb Temp 26 0.990 141.095 4.037 128.600 138.400 140.800 143.800 154.000 ▁▅▇▃▁
PSC Fill 23 0.991 0.195 0.118 0.000 0.100 0.180 0.260 0.620 ▆▇▃▁▁
Fill Pressure 22 0.991 47.922 3.178 34.600 46.000 46.400 50.000 60.400 ▁▁▇▂▁
Filler Level 20 0.992 109.252 15.698 55.800 98.300 118.400 120.000 161.200 ▁▃▅▇▁
Hyd Pressure2 15 0.994 20.961 16.386 0.000 0.000 28.600 34.600 59.400 ▇▂▇▅▁
Hyd Pressure3 15 0.994 20.458 15.976 -1.200 0.000 27.600 33.400 50.000 ▇▁▃▇▁
Temperature 14 0.995 65.968 1.383 63.600 65.200 65.600 66.400 76.200 ▇▃▁▁▁
Oxygen Filler 12 0.995 0.047 0.047 0.002 0.022 0.033 0.060 0.400 ▇▁▁▁▁
Pressure Setpoint 12 0.995 47.615 2.039 44.000 46.000 46.000 50.000 52.000 ▁▇▁▆▁
Hyd Pressure1 11 0.996 12.438 12.433 -0.800 0.000 11.400 20.200 58.000 ▇▅▂▁▁
Carb Volume 10 0.996 5.370 0.106 5.040 5.293 5.347 5.453 5.700 ▁▆▇▅▁
Carb Rel 10 0.996 5.437 0.129 4.960 5.340 5.400 5.540 6.060 ▁▇▇▂▁
Alch Rel 9 0.996 6.897 0.505 5.280 6.540 6.560 7.240 8.620 ▁▇▂▃▁
Usage cont 5 0.998 20.993 2.978 12.080 18.360 21.790 23.755 25.900 ▁▃▅▃▇
PH 4 0.998 8.546 0.173 7.880 8.440 8.540 8.680 9.360 ▁▅▇▂▁
Mnf Flow 2 0.999 24.569 119.481 -100.200 -100.000 65.200 140.800 229.400 ▇▁▁▇▂
Carb Flow 2 0.999 2468.354 1073.696 26.000 1144.000 3028.000 3186.000 5104.000 ▂▅▆▇▁
Bowl Setpoint 2 0.999 109.327 15.303 70.000 100.000 120.000 120.000 140.000 ▁▂▃▇▁
Density 1 1.000 1.174 0.378 0.240 0.900 0.980 1.620 1.920 ▁▅▇▂▆
Balling 1 1.000 2.198 0.931 -0.170 1.496 1.648 3.292 4.012 ▁▇▇▁▇
Balling Lvl 1 1.000 2.050 0.870 0.000 1.380 1.480 3.140 3.660 ▁▇▂▁▆
Pressure Vacuum 0 1.000 -5.216 0.570 -6.600 -5.600 -5.400 -5.000 -3.600 ▂▇▆▂▁
Air Pressurer 0 1.000 142.834 1.212 140.800 142.200 142.600 143.000 148.200 ▅▇▁▁▁

Next, we display the one character variable Brand below and its skimr output below.

Character Variables in Beverage Data

skim_variable n_missing complete_rate min max empty n_unique whitespace
Brand Code 120 0.953 1 1 0 4 0

After loading the raw Excel file, we find the training dataset has 2571 observations with 32 predictors and 1 response variable PH.

General observations about the skimr results:

Our first transformation below is to add a unique identifier id based on row number to obtain a primary key.
Then we drop the 4 observations which are missing PH values, rearrange the response variable PH and BrandCode for our convenience. Lastly, we also eliminate spaces from the predictor column names to avoid using quote in our code.

raw_StudentData %>% 
  rename_with(~ str_replace(., " ", ""),  everything() ) %>%
  mutate( id = row_number() )  %>% 
  relocate(id) %>%
  relocate(PH, .after = id) %>%
  relocate(`BrandCode`, .after = `BallingLvl` ) %>%
  filter( !is.na(PH)) %>% as_tibble() ->  sdata_v1

dim(sdata_v1)
## [1] 2567   34

Rows with Missing Data

Is there any observations with a high number of missing observations?

# include any rows where any column is NA.  Always include id column
sdata_v1 %>% dplyr::filter_all( any_vars(is.na(.))) -> x1  

# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
x1 %>% select_if( ~ any(is.na(.) ) ) -> x2

# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
x2 %>% mutate( id = x1$id  ) %>% relocate(id)-> missing_train_rows
  
dim(missing_train_rows)
## [1] 529  28

We find there are 529 observations with NA observations. So 20.6% have NA values. We conclude that the NA values are sufficiently well distributed that removal of all observations with NA values is impractical.

In the marginal table below, we investigate if certain observations have a high incidence of NA values amongst their predictors. The answer shows that the incident of multiple

Number of Rows with NA

num_na count
1 345
2 119
3 45
4 13
5 3
6 2
7 1
8 1
## [1] "There are 812 cells with NA values out of 82144 equivalent to: 0.99%"

Overall, only 1 percent of cells have missing values.
The concentration of missing values seems to decline at a geometric ratio suggesting independence among the occurence of missing values.

Evaluation of Missing Data in the Test Set

## [1] 267  33
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
MFR 31 0.884 697.801 96.400 15.600 707.000 724.600 731.450 784.800 ▁▁▁▁▇
Filler Speed 10 0.963 3581.385 911.187 1006.000 3812.000 3978.000 3996.000 4020.000 ▁▁▁▁▇
Fill Ounces 6 0.978 23.970 0.076 23.747 23.920 23.967 24.013 24.200 ▁▅▇▃▁
PSC 5 0.981 0.085 0.053 0.004 0.044 0.076 0.112 0.246 ▆▇▃▂▁
PSC CO2 5 0.981 0.051 0.038 0.000 0.020 0.040 0.060 0.240 ▇▃▂▁▁
PC Volume 4 0.985 0.278 0.063 0.099 0.233 0.275 0.322 0.464 ▁▆▇▅▁
Carb Pressure1 4 0.985 123.040 4.420 113.000 120.200 123.400 125.500 136.000 ▃▃▇▂▁
Hyd Pressure4 4 0.985 97.840 13.921 68.000 90.000 98.000 104.000 140.000 ▅▆▇▂▁
PSC Fill 3 0.989 0.190 0.113 0.020 0.100 0.180 0.260 0.620 ▇▇▃▁▁
Oxygen Filler 3 0.989 0.047 0.050 0.002 0.020 0.034 0.054 0.398 ▇▁▁▁▁
Alch Rel 3 0.989 6.907 0.498 6.400 6.540 6.580 7.180 7.820 ▇▁▂▁▃
Fill Pressure 2 0.993 48.141 3.440 37.800 46.000 47.800 50.200 60.200 ▁▇▇▂▁
Filler Level 2 0.993 110.292 15.502 69.200 100.600 118.600 120.200 153.200 ▂▃▇▇▁
Temperature 2 0.993 66.227 1.692 63.800 65.400 65.800 66.600 75.400 ▇▅▁▁▁
Usage cont 2 0.993 20.897 2.999 12.900 18.120 21.440 23.740 24.600 ▁▃▃▃▇
Pressure Setpoint 2 0.993 47.733 2.057 44.000 46.000 46.000 50.000 52.000 ▁▇▁▆▁
Carb Rel 2 0.993 5.440 0.127 5.180 5.340 5.400 5.560 5.740 ▂▇▂▃▂
Carb Volume 1 0.996 5.369 0.111 5.147 5.287 5.340 5.465 5.667 ▂▇▃▅▁
Carb Temp 1 0.996 141.226 4.296 130.000 138.400 140.800 143.800 154.000 ▁▆▇▃▁
Hyd Pressure2 1 0.996 20.113 17.214 -50.000 0.000 26.800 34.800 61.400 ▁▁▆▇▁
Hyd Pressure3 1 0.996 19.609 16.557 -50.000 0.000 27.700 33.000 49.200 ▁▁▆▃▇
Density 1 0.996 1.177 0.382 0.060 0.920 0.980 1.600 1.840 ▁▁▇▁▅
Balling 1 0.996 2.203 0.925 0.902 1.498 1.648 3.242 3.788 ▅▇▁▂▅
Pressure Vacuum 1 0.996 -5.174 0.582 -6.400 -5.600 -5.200 -4.800 -3.600 ▁▇▆▃▁
Bowl Setpoint 1 0.996 109.624 15.017 70.000 100.000 120.000 120.000 130.000 ▁▂▁▃▇
Air Pressurer 1 0.996 142.831 1.231 141.200 142.200 142.600 142.800 147.200 ▅▇▁▁▁
Carb Pressure 0 1.000 68.255 3.857 60.200 65.300 68.000 70.600 77.600 ▃▆▇▃▂
Mnf Flow 0 1.000 21.034 117.756 -100.200 -100.000 0.200 141.300 220.400 ▇▁▁▆▂
Hyd Pressure1 0 1.000 12.010 13.531 -50.000 0.000 10.400 20.400 50.000 ▁▁▇▆▂
Carb Flow 0 1.000 2408.644 1161.364 0.000 1083.000 3038.000 3215.000 3858.000 ▂▃▁▆▇
Balling Lvl 0 1.000 2.051 0.883 0.000 1.380 1.480 3.080 3.420 ▁▃▇▁▇

Character Variables in Beverage Data

skim_variable n_missing complete_rate min max empty n_unique whitespace
Brand Code 8 0.97 1 1 0 4 0
## [1] 267  34
# include any rows where any column is NA.  Always include id column
#
#  NOTE THAT WE EXCLUDE PH because the response variable (by design) is NA
#  for all observations in the test set.
edata_v1 %>% select(-PH) %>% dplyr::filter_all( any_vars(is.na(.))) -> ex1  

# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
ex1 %>% select_if( ~ any(is.na(.) ) ) -> ex2

# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
ex2 %>% mutate( id = ex1$id  ) %>% relocate(id)-> missing_test_rows
  
dim(missing_test_rows)
## [1] 67 28

Number of Rows with NA

num_na count
1 50
2 11
3 3
4 1
8 1
14 1
## [1] "There are 107 cells with NA values (excluding PH) out of 8544 equivalent to: 1.25%"

We conclude that the distribution of missing values in the test set is:

Exploratory Data Analysis

Scatterplot of Predictor Variables

The scatterplots of the predictor variables against the dependent variable PH indicate no clear linear relationship with any predictor variable. The scatterplots indicate some variable measurements are not continuous and instead fall on specific values on the x-axis.

Density Plots

The density plots of each predictor variable show many predictor variables with non-normal distributions.

Normality Test

The Shapiro-Wilk test of normality is performed on all the numeric predictor variables and the resulting list confirms no p-value above 0.05 and thus no predictor variable follows a normal distribution.

## [1] 0

Distribution by Brand Code

The distribution of the Brand Code, the lone non-numeric predictor variable, shows a high count of brand B, comprising almost half of the samples. Also note, some values are missing for the Brand Code variable.

Boxplot of PH by Brand Code

The boxplot of PH values by Brand Code certainly captures overlap of the middle two quartiles, which seems reasonable given the PH range is rather narrow. Of note, the median values of the 4 labeled brands and the unlabeled group do appear distinct besides Brand A and the unlabeled.

VSURF

To find significant variables for the models, the VSURF algorithm is applied to the provided dataset. The algorithm identifies 14 variables as significant to the predictions.

Data Partition into Training and Test Sets

Next we partition the student evaluation data set by training and test sets. We will retain the terminology of test set to refer to the validation data set used for model performance assessment outside of the cross validation process.

set.seed(19372)
train_indices = createDataPartition( sdata_v1$PH , times = 1, list = FALSE, p = 0.80 )

sdata_test    = sdata_v1[-train_indices , ]
sdata_train   = sdata_v1[ train_indices , ]
sdataY_test   = sdata_v1$PH[-train_indices ]
sdataY_train  = sdata_v1$PH[ train_indices ]

Drop Near Zero Variance Predictors

We conclude there are no near zero variance predictors or constant predictors to be dropped from the training data set.

No Near Zero Variance Predictors in Training

freqRatio percentUnique zeroVar nzv
id 1.00 100.00 FALSE FALSE
PH 1.05 2.48 FALSE FALSE
CarbVolume 1.03 4.72 FALSE FALSE
FillOunces 1.18 4.38 FALSE FALSE
PCVolume 1.19 21.22 FALSE FALSE
CarbPressure 1.10 5.06 FALSE FALSE
CarbTemp 1.04 5.99 FALSE FALSE
PSC 1.14 6.23 FALSE FALSE
PSCFill 1.11 1.56 FALSE FALSE
PSCCO2 1.08 0.63 FALSE FALSE
MnfFlow 1.08 21.75 FALSE FALSE
CarbPressure1 1.02 6.37 FALSE FALSE
FillPressure 1.78 5.06 FALSE FALSE
HydPressure1 31.86 11.58 FALSE FALSE
HydPressure2 7.63 9.39 FALSE FALSE
HydPressure3 12.71 8.91 FALSE FALSE
HydPressure4 1.02 1.90 FALSE FALSE
FillerLevel 1.05 12.90 FALSE FALSE
FillerSpeed 1.00 10.41 FALSE FALSE
Temperature 1.05 2.63 FALSE FALSE
Usagecont 1.12 22.43 FALSE FALSE
CarbFlow 1.33 23.21 FALSE FALSE
Density 1.19 3.65 FALSE FALSE
MFR 1.04 25.94 FALSE FALSE
Balling 1.24 9.34 FALSE FALSE
PressureVacuum 1.43 0.78 FALSE FALSE
OxygenFiller 1.32 15.67 FALSE FALSE
BowlSetpoint 2.80 0.49 FALSE FALSE
PressureSetpoint 1.33 0.39 FALSE FALSE
AirPressurer 1.03 1.41 FALSE FALSE
AlchRel 1.10 2.53 FALSE FALSE
CarbRel 1.02 1.95 FALSE FALSE
BallingLvl 1.23 3.89 FALSE FALSE
BrandCode 2.01 0.19 FALSE FALSE

Outliers and Zeros

Next we identify and fix zeros and outliers in the predictors before imputing missing values.
The reason is that KNN algorithm may be affected by outliers. The table below uses skim to report the minimum, maximum, median, mean and standard deviation for each predictor. In addition, we construct Z-score metrics that tells us how many standard deviations is the minimum and maximum.

\[ \text{zscore_min}(X_i) = \frac{ min(X_i) - \mu(X_i) }{\sigma(X_i)}\] \[\text{zscore_max}(X_i) = \frac{ max(X_i) - \mu(X_i) }{\sigma(X_i )}\]

Predictors with Outliers by Z-Score

skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.hist zscore_min zscore_max
MFR 156 0.92 704.15 74.72 ▁▁▁▂▇ -9.00 2.20
OxygenFiller 7 1.00 0.05 0.04 ▇▁▁▁▁ -0.97 7.88
Temperature 8 1.00 65.95 1.36 ▇▃▁▁▁ -1.73 7.55
CarbRel 6 1.00 5.44 0.13 ▁▇▇▂▁ -3.70 4.84
AirPressurer 0 1.00 142.83 1.20 ▇▇▁▁▁ -1.52 4.48
PSCCO2 29 0.99 0.06 0.04 ▇▅▂▁▁ -1.31 4.31
FillPressure 15 0.99 47.91 3.14 ▁▁▇▅▁ -4.23 3.72

Using scatterplots of each of the 7 identified predictors below, we see no obvious data errors exist in the distribution of the predictors with outlier values.
Isolated points exist but lie within a reasonsable proximity to neighboring points in the scatter plot. Therefore no data corrections for outliers will be applied to the training data.

Handle Special Cases

The code below will transform the special case situations of data pre-processing identified earlier:

These rules are applied consistently to be the training, validation and test data sets prior to subsequent pre processing.

KNN Imputation

We apply the caret pre processing function to that training, validation and test data sets.

# Build the caret function to preprocess the Chemical data and impute missing values.
# There is a bug in caret which causes tibbles to be rejected.  They need to be cast as data frames.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(as.data.frame(sdata_v2_train[,3:33]) , method = c("BoxCox", "center", "scale",  "knnImpute") )

# Becomes the source data for the model building
sdata_train_pp = predict( preProcFunc,  as.data.frame(sdata_v2_train ) )

# Becomes the final version of test data for validation
sdata_test_pp  = predict( preProcFunc,  as.data.frame(sdata_v2_test) )

# Need to generate predictions based on test data without known response
edata_test_pp  = predict( preProcFunc,  as.data.frame(edata_v2) )


sdata_trainX_pp  = sdata_train_pp %>% select( -PH)
sdata_testX_pp   = sdata_test_pp %>% select(-PH)
edata_testX_pp   = edata_test_pp %>% select( -PH)
Data summary
Name sdata_train_pp
Number of rows 2055
Number of columns 34
_______________________
Column type frequency:
character 1
numeric 33
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
BrandCode 0 1 1 1 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 1297.57 751.06 1.00 644.50 1296.00 1961.00 2568.00 ▇▇▇▇▇
PH 0 1 8.55 0.17 7.88 8.44 8.54 8.68 9.36 ▁▅▇▁▁
CarbVolume 0 1 0.00 1.00 -3.42 -0.71 -0.26 0.86 2.90 ▁▃▇▅▁
FillOunces 0 1 0.00 0.99 -3.88 -0.62 -0.01 0.60 3.91 ▁▂▇▂▁
PCVolume 0 1 0.01 1.00 -3.72 -0.60 -0.07 0.59 3.01 ▁▂▇▃▁
CarbPressure 0 1 0.01 1.00 -3.11 -0.72 0.04 0.70 2.88 ▁▃▇▅▁
CarbTemp 0 1 0.01 1.00 -3.47 -0.65 0.01 0.69 2.90 ▁▂▇▅▁
PSC 0 1 0.00 0.99 -2.75 -0.65 0.01 0.66 2.83 ▁▅▇▅▁
PSCFill 0 1 0.00 1.00 -1.66 -0.81 -0.13 0.54 3.60 ▆▇▃▁▁
PSCCO2 0 1 0.00 0.99 -1.31 -0.84 -0.37 0.33 4.31 ▇▅▂▁▁
MnfFlow 0 1 0.00 1.00 -1.05 -1.05 0.50 0.97 1.58 ▇▁▁▆▂
CarbPressure1 0 1 -0.01 1.00 -3.63 -0.77 0.13 0.60 3.77 ▁▃▇▂▁
FillPressure 0 1 0.00 1.00 -5.38 -0.58 -0.45 0.70 3.23 ▁▁▇▆▁
HydPressure1 0 1 0.00 1.00 -1.07 -1.01 -0.08 0.65 3.66 ▇▅▂▁▁
HydPressure2 0 1 0.00 1.00 -1.29 -1.29 0.47 0.83 2.34 ▇▂▇▅▁
HydPressure3 0 1 0.00 1.00 -1.37 -1.29 0.44 0.80 1.84 ▇▁▃▇▁
HydPressure4 0 1 0.01 1.00 -3.40 -0.76 0.07 0.52 2.82 ▁▅▇▇▁
FillerLevel 0 1 0.00 1.00 -2.85 -0.72 0.54 0.68 4.30 ▁▅▇▁▁
FillerSpeed 0 1 -0.05 1.05 -3.44 0.20 0.41 0.44 0.51 ▁▁▁▁▇
Temperature 0 1 0.00 1.00 -1.91 -0.57 -0.25 0.38 6.50 ▇▆▁▁▁
Usagecont 0 1 0.00 1.00 -2.46 -0.92 0.20 0.95 1.83 ▁▇▅▇▆
CarbFlow 0 1 0.00 1.00 -2.02 -1.33 0.52 0.69 2.99 ▃▁▇▁▁
Density 0 1 0.00 1.00 -4.85 -0.68 -0.41 1.18 1.71 ▁▁▁▇▅
MFR 0 1 0.00 1.00 -6.70 -0.02 0.29 0.41 3.35 ▁▁▁▇▁
Balling 0 1 0.00 1.00 -6.22 -0.74 -0.50 1.19 1.68 ▁▁▁▇▅
PressureVacuum 0 1 0.00 1.00 -2.42 -0.67 -0.31 0.39 2.84 ▂▇▅▃▂
OxygenFiller 0 1 0.00 1.00 -1.98 -0.35 0.03 0.62 3.21 ▃▇▆▂▁
BowlSetpoint 0 1 0.00 1.00 -2.39 -0.73 0.70 0.70 2.40 ▁▃▃▇▁
PressureSetpoint 0 1 0.00 1.00 -1.94 -0.77 -0.77 1.16 1.96 ▁▇▁▆▁
AirPressurer 0 1 0.00 1.00 -1.58 -0.53 -0.18 0.16 4.37 ▅▇▁▁▁
AlchRel 0 1 0.00 1.00 -5.09 -0.72 -0.62 0.79 2.76 ▁▁▇▂▃
CarbRel 0 1 0.00 1.00 -4.28 -0.75 -0.26 0.97 4.21 ▁▂▇▅▁
BallingLvl 0 1 0.00 1.00 -2.36 -0.77 -0.66 1.26 1.86 ▁▇▂▁▆
Data summary
Name sdata_test_pp
Number of rows 512
Number of columns 34
_______________________
Column type frequency:
character 1
numeric 33
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
BrandCode 0 1 1 1 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 1237.47 705.94 5.00 638.00 1248.50 1803.75 2571.00 ▇▇▇▇▆
PH 0 1 8.55 0.17 7.98 8.44 8.54 8.68 8.94 ▁▃▇▇▃
CarbVolume 0 1 0.01 1.01 -3.12 -0.71 -0.19 0.80 2.52 ▁▃▇▅▂
FillOunces 0 1 0.03 1.01 -3.20 -0.55 -0.01 0.60 3.99 ▁▅▇▂▁
PCVolume 0 1 -0.01 0.93 -3.47 -0.55 -0.04 0.58 2.79 ▁▂▇▅▁
CarbPressure 0 1 0.00 0.99 -3.54 -0.72 -0.02 0.71 2.74 ▁▂▇▇▁
CarbTemp 0 1 -0.02 1.00 -3.47 -0.65 -0.09 0.64 2.90 ▁▂▇▅▁
PSC 0 1 -0.05 1.06 -2.53 -0.81 -0.08 0.67 2.76 ▂▇▇▅▂
PSCFill 0 1 -0.02 0.99 -1.66 -0.81 -0.13 0.54 3.43 ▆▇▃▂▁
PSCCO2 0 1 0.07 1.03 -1.31 -0.84 -0.37 0.56 4.31 ▇▅▂▁▁
MnfFlow 0 1 -0.06 0.98 -1.05 -1.05 -0.22 0.92 1.69 ▇▁▁▇▁
CarbPressure1 0 1 -0.02 1.06 -3.46 -0.90 0.09 0.65 3.34 ▁▅▇▅▁
FillPressure 0 1 0.02 1.04 -3.82 -0.58 -0.45 0.70 3.41 ▁▁▇▅▁
HydPressure1 0 1 -0.03 1.01 -1.07 -1.01 -0.12 0.57 3.21 ▇▅▂▁▁
HydPressure2 0 1 -0.04 1.00 -1.29 -1.29 0.42 0.81 2.08 ▇▁▆▆▁
HydPressure3 0 1 -0.05 1.00 -1.37 -1.29 0.40 0.79 1.81 ▇▁▃▇▁
HydPressure4 0 1 0.07 0.99 -2.16 -0.58 0.23 0.52 2.82 ▂▃▇▂▁
FillerLevel 0 1 -0.07 1.05 -2.48 -1.22 0.57 0.69 3.08 ▂▃▇▁▁
FillerSpeed 0 1 -0.16 1.18 -3.43 0.05 0.35 0.44 0.48 ▁▁▁▁▇
Temperature 0 1 0.06 1.05 -1.91 -0.57 -0.10 0.38 6.30 ▆▇▁▁▁
Usagecont 0 1 -0.05 1.01 -2.54 -1.04 0.19 0.94 1.78 ▁▇▅▇▇
CarbFlow 0 1 0.08 1.00 -2.02 -1.25 0.54 0.74 1.43 ▃▁▁▇▅
Density 0 1 0.02 1.00 -4.14 -0.68 -0.41 1.18 1.68 ▁▁▆▇▇
MFR 0 1 -0.01 0.97 -6.40 -0.04 0.29 0.39 2.91 ▁▁▁▇▁
Balling 0 1 -0.01 1.01 -3.70 -0.82 -0.50 1.19 1.63 ▁▁▇▂▆
PressureVacuum 0 1 0.04 1.00 -2.42 -0.67 0.04 0.39 2.84 ▂▇▆▃▂
OxygenFiller 0 1 0.09 0.98 -1.98 -0.27 0.08 0.69 3.21 ▃▇▇▂▁
BowlSetpoint 0 1 -0.09 1.08 -2.39 -1.35 0.70 0.70 2.40 ▂▃▂▇▁
PressureSetpoint 0 1 0.02 0.98 -1.94 -0.77 -0.77 1.16 1.28 ▁▇▁▁▆
AirPressurer 0 1 0.04 1.05 -1.76 -0.53 -0.18 0.16 3.91 ▂▇▁▁▁
AlchRel 0 1 0.00 0.99 -1.08 -0.72 -0.67 0.68 2.74 ▇▁▂▃▁
CarbRel 0 1 -0.01 1.00 -3.67 -0.75 -0.26 0.83 3.02 ▁▂▇▅▁
BallingLvl 0 1 0.01 1.01 -2.36 -0.77 -0.68 1.26 1.65 ▁▆▇▁▇
Data summary
Name edata_test_pp
Number of rows 267
Number of columns 34
_______________________
Column type frequency:
character 1
logical 1
numeric 32
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
BrandCode 0 1 1 1 0 5 0

Variable type: logical

skim_variable n_missing complete_rate mean count
PH 267 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 134.00 77.22 1.00 67.50 134.00 200.50 267.00 ▇▇▇▇▇
CarbVolume 0 1 -0.02 1.04 -2.23 -0.81 -0.26 0.89 2.63 ▂▇▅▅▁
FillOunces 0 1 -0.04 0.86 -2.60 -0.62 -0.09 0.45 2.60 ▁▅▇▃▁
PCVolume 0 1 0.01 1.04 -3.28 -0.70 0.01 0.75 2.82 ▁▃▇▇▁
CarbPressure 0 1 0.01 1.08 -2.42 -0.80 -0.02 0.70 2.46 ▂▆▇▅▂
CarbTemp 0 1 0.03 1.06 -3.04 -0.65 -0.04 0.71 2.90 ▁▃▇▃▁
PSC 0 1 -0.01 1.08 -2.53 -0.75 0.01 0.66 2.55 ▂▅▇▃▂
PSCFill 0 1 -0.05 0.96 -1.49 -0.73 -0.13 0.46 3.60 ▆▇▃▁▁
PSCCO2 0 1 -0.12 0.89 -1.31 -0.84 -0.37 0.09 4.31 ▇▃▂▁▁
MnfFlow 0 1 -0.04 0.98 -1.05 -1.05 -0.22 0.96 1.62 ▇▁▁▆▂
CarbPressure1 0 1 0.09 0.94 -2.05 -0.53 0.17 0.60 2.87 ▂▃▇▂▁
FillPressure 0 1 0.06 1.10 -3.82 -0.58 0.01 0.76 3.37 ▁▁▇▆▁
HydPressure1 0 1 -0.04 1.09 -5.04 -1.01 -0.17 0.63 3.02 ▁▁▇▆▂
HydPressure2 0 1 -0.07 1.05 -4.34 -1.29 0.35 0.83 2.46 ▁▁▆▇▁
HydPressure3 0 1 -0.07 1.04 -4.42 -1.29 0.44 0.77 1.79 ▁▁▆▃▇
HydPressure4 0 1 0.13 1.05 -2.63 -0.41 0.23 0.66 2.73 ▁▂▇▃▂
FillerLevel 0 1 0.06 1.01 -2.32 -0.66 0.56 0.69 3.51 ▃▃▇▁▁
FillerSpeed 0 1 -0.23 1.25 -3.43 0.04 0.28 0.44 0.48 ▁▁▁▁▇
Temperature 0 1 0.21 1.20 -1.74 -0.40 -0.09 0.53 6.09 ▇▇▂▁▁
Usagecont 0 1 -0.05 1.01 -2.37 -1.02 0.06 0.94 1.29 ▁▅▃▂▇
CarbFlow 0 1 -0.03 1.07 -2.02 -1.36 0.53 0.73 1.46 ▅▁▁▇▃
Density 0 1 -0.02 1.14 -9.22 -0.64 -0.41 1.14 1.58 ▁▁▁▃▇
MFR 0 1 -0.05 1.16 -6.71 -0.01 0.29 0.41 1.50 ▁▁▁▁▇
Balling 0 1 0.00 1.00 -1.98 -0.74 -0.50 1.16 1.54 ▁▇▃▁▇
PressureVacuum 0 1 0.08 1.02 -2.07 -0.67 0.04 0.74 2.84 ▁▇▆▃▁
OxygenFiller 0 1 0.02 0.99 -1.98 -0.41 0.05 0.55 3.20 ▅▇▇▂▁
BowlSetpoint 0 1 0.00 0.99 -2.39 -0.73 0.70 0.70 1.52 ▂▂▃▇▁
PressureSetpoint 0 1 0.06 1.00 -1.94 -0.77 -0.77 1.16 1.96 ▁▇▁▆▁
AirPressurer 0 1 0.01 1.03 -1.40 -0.53 -0.18 -0.01 3.60 ▅▇▁▁▁
AlchRel 0 1 0.01 0.98 -1.08 -0.72 -0.62 0.68 1.74 ▇▁▁▁▃
CarbRel 0 1 0.02 0.99 -2.14 -0.75 -0.26 0.97 2.24 ▁▇▃▃▃
BallingLvl 0 1 0.00 1.02 -2.36 -0.77 -0.66 1.19 1.58 ▁▃▇▁▇

Data visualization

First, we consider a correlation matrix where pairwise complete observations are allowed for use in computing correlations. Due to the low percentage of imcomplete cells in the dataset, we prefer this dropping observations with any incomplete columns.

The correlation matrix below uses hierarchical clustering of the predictors to form 6 groups of variables using the corrplot package. This was the most efficient way to gain insight on the related clusters of variables.

M = cor(sdata_v1[,2:33], use = "pairwise.complete.obs")

corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 ,  tl.cex = 0.7 )

We compare the correlations to the cleaned pre-processed training data.

Mpp = cor(sdata_train_pp[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 ,  tl.cex = 0.7 )

Exporting the Post-Processed Data Sets

Our last step of preprocessing is to export the data sets. This need only but done once.

readr::write_csv(sdata_train_pp, "sdata_train_pp.csv", append = FALSE )

readr::write_csv(sdata_test_pp, "sdata_test_pp.csv", append = FALSE )

readr::write_csv(edata_test_pp, "edata_test_pp.csv", append = FALSE )

The above data files have the following properties and relationships to the original raw files:

Pre-Processed Data Files

filename primary_key response numeric_columns num_rows file_source
sdata_train_pp.csv id - consistent with sdata_test_pp PH - unchanged BoxCox, center, scaled, imputed 2055 StudentData.xlsx
sdata_test_pp.csv id - consistent with sdata_train_pp PH - unchanged BoxCox, center, scaled, imputed 512 StudentData.xlsx
edata_test_pp.csv id - standalone PH - unchanged BoxCox, center, scaled, imputed 267 StudentEvaluation.xlsx

An important note about the id column. The id in the file edata_test_pp.csv is the row number of the same observation in StudentEvaluation.xlsx. But the same is not true for sdata_train_pp.csv and sdata_test_pp.csv because they are sampled from the source file StudentData.xlsx. The 10th row of sdata_test_pp.csv is not necessarily the 10th row of StudentData.xlsx rather one should use the id field as the corresponding row number of the original file.

Loading and using the pre-Processed data:

The csv files listed above can be loaded and used for model building with no further transformation. Since they contain a primary key and the response variables and predictors, they have the necessary information to fit a model and generate predictions on training, validation and test samples.

We load the csv file for the training data set below and display the first few column characteristics from the file.

sdata_train_pp = read_csv("sdata_train_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% as_tibble()

skim(sdata_train_pp) %>% yank("numeric") %>% head(n=5)

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 1297.57 751.06 1.00 644.50 1296.00 1961.00 2568.00 ▇▇▇▇▇
PH 0 1 8.55 0.17 7.88 8.44 8.54 8.68 9.36 ▁▅▇▁▁
CarbVolume 0 1 0.00 1.00 -3.42 -0.71 -0.26 0.86 2.90 ▁▃▇▅▁
FillOunces 0 1 0.00 0.99 -3.88 -0.62 -0.01 0.60 3.91 ▁▂▇▂▁
PCVolume 0 1 0.01 1.00 -3.72 -0.60 -0.07 0.59 3.01 ▁▂▇▃▁
skim(sdata_train_pp) %>% yank("character") 
## NULL

Now we load the validation data test (20% of the original training observations) below and inspect its initial columns.

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 1237.47 705.94 5.00 638.00 1248.50 1803.75 2571.00 ▇▇▇▇▆
PH 0 1 8.55 0.17 7.98 8.44 8.54 8.68 8.94 ▁▃▇▇▃
CarbVolume 0 1 0.01 1.01 -3.12 -0.71 -0.19 0.80 2.52 ▁▃▇▅▂
FillOunces 0 1 0.03 1.01 -3.20 -0.55 -0.01 0.60 3.99 ▁▅▇▂▁
PCVolume 0 1 -0.01 0.93 -3.47 -0.55 -0.04 0.58 2.79 ▁▂▇▅▁

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
BrandCode 0 1 1 1 0 5 0

Now we load the test data test below and inspect its initial columns.

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 134.00 77.22 1.00 67.50 134.00 200.50 267.00 ▇▇▇▇▇
CarbVolume 0 1 -0.02 1.04 -2.23 -0.81 -0.26 0.89 2.63 ▂▇▅▅▁
FillOunces 0 1 -0.04 0.86 -2.60 -0.62 -0.09 0.45 2.60 ▁▅▇▃▁
PCVolume 0 1 0.01 1.04 -3.28 -0.70 0.01 0.75 2.82 ▁▃▇▇▁
CarbPressure 0 1 0.01 1.08 -2.42 -0.80 -0.02 0.70 2.46 ▂▆▇▅▂

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
BrandCode 0 1 1 1 0 5 0

Model Building

We demonstrate building a linear regression model using the pre-processed training data set.

Code

We summarize all the R code used in this project in this appendix for ease of reading.

knitr::opts_chunk$set(echo = FALSE)

# Conditional Code Evaluation is managed here.

library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(GGally) # for ggpairs
library(readxl) # to parse Excel workbooks
library(openxlsx) # for writing xlsx files
library(corrplot)
library(RColorBrewer)
library(caret)
library(VSURF)
library(tictoc)
tic()
.code-bg {
  background-color: lightgray;
}

raw_StudentData <- read_excel("StudentData.xlsx")
dim(raw_StudentData)
skim(raw_StudentData) %>% 
  yank("numeric") %>% 
  arrange( complete_rate) %>% 
  kable( digits = 3 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

skim(raw_StudentData) %>% yank("character") %>%
  kable(  caption = "Character Variables in Beverage Data" , digits = 3) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

raw_StudentData %>% 
  rename_with(~ str_replace(., " ", ""),  everything() ) %>%
  mutate( id = row_number() )  %>% 
  relocate(id) %>%
  relocate(PH, .after = id) %>%
  relocate(`BrandCode`, .after = `BallingLvl` ) %>%
  filter( !is.na(PH)) %>% as_tibble() ->  sdata_v1

dim(sdata_v1)
# include any rows where any column is NA.  Always include id column
sdata_v1 %>% dplyr::filter_all( any_vars(is.na(.))) -> x1  

# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
x1 %>% select_if( ~ any(is.na(.) ) ) -> x2

# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
x2 %>% mutate( id = x1$id  ) %>% relocate(id)-> missing_train_rows
  
dim(missing_train_rows)
missing_train_rows %>% mutate( num_na = rowSums(is.na(.))) -> missing_train_summary

missing_train_summary %>% group_by(num_na) %>%
  summarize( count = n()) %>%
  kable( caption = "Number of Rows with NA") %>%
  kable_styling(position = "left", bootstrap_options = c("hover", "striped"))
num_na_cells = sum( missing_train_summary$num_na)
num_cells = nrow(sdata_v1) * 32  # excludes the response variable PH since we dropped its values.

print(paste0("There are ", num_na_cells , " cells with NA values out of ", num_cells, " equivalent to: " , sprintf("%.2f%%", 100 * num_na_cells / num_cells ) ) )
raw_StudentEvaluation <- read_excel("StudentEvaluation.xlsx")
dim(raw_StudentEvaluation)
skim(raw_StudentEvaluation) %>% 
  yank("numeric") %>% 
  arrange( complete_rate) %>% 
  kable( digits = 3 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

skim(raw_StudentEvaluation) %>% yank("character") %>%
  kable(  caption = "Character Variables in Beverage Data" , digits = 3) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
raw_StudentEvaluation %>% 
  rename_with(~ str_replace(., " ", ""),  everything() ) %>%
  mutate( id = row_number() )  %>% 
  relocate(id) %>%
  relocate(PH, .after = id) %>%
  relocate(`BrandCode`, .after = `BallingLvl` )  %>% as_tibble() ->  edata_v1

dim(edata_v1)
# include any rows where any column is NA.  Always include id column
#
#  NOTE THAT WE EXCLUDE PH because the response variable (by design) is NA
#  for all observations in the test set.
edata_v1 %>% select(-PH) %>% dplyr::filter_all( any_vars(is.na(.))) -> ex1  

# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
ex1 %>% select_if( ~ any(is.na(.) ) ) -> ex2

# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
ex2 %>% mutate( id = ex1$id  ) %>% relocate(id)-> missing_test_rows
  
dim(missing_test_rows)
missing_test_rows %>% mutate( num_na = rowSums(is.na(.))) -> missing_test_summary

missing_test_summary %>% group_by(num_na) %>%
  summarize( count = n()) %>%
  kable( caption = "Number of Rows with NA") %>%
  kable_styling(position = "left", bootstrap_options = c("hover", "striped"))
num_na_cells_test = sum( missing_test_summary$num_na)
num_cells_test = nrow(edata_v1) * 32  # excludes the response variable PH since we dropped its values.

print(paste0("There are ", num_na_cells_test , " cells with NA values (excluding PH) out of ", num_cells_test, " equivalent to: " , 
             sprintf("%.2f%%", 100 * num_na_cells_test / num_cells_test ) ) )


skim_sdata = skim(sdata_v1)
skim_edata = skim(edata_v1)

skim_sdata %>% select(  skim_variable, n_missing, complete_rate) %>% inner_join(skim_edata %>% select(skim_variable, n_missing, complete_rate), by = c("skim_variable") ) %>% filter(skim_variable != 'PH' ) -> skim_complete_rates 

skim_complete_rates %>% rename( training_complete = complete_rate.x, testing_complete = complete_rate.y) -> skim_complete_rates
skim_complete_rates %>% as_tibble() %>%  ggplot(aes(x=training_complete, y = testing_complete)) + 
  geom_point() + geom_smooth(method = "lm") + theme(aspect.ratio = 1) + lims( x = c(0.95, 1), y = c(0.95, 1)) +
  labs(title = "Completion Rate by Predictor between Training and Test Sets")

# Feature plot for the numeric predictor variables against the result variable PH
cols <- sdata_v1 %>%
  select(-c('id', 'BrandCode', 'PH')) %>% colnames()

featurePlot(sdata_v1[,cols], 
            sdata_v1$PH, 
            plot="scatter",
            type = c("p", "smooth"),
            span = .5,
            layout=c(6,6))
#https://statisticsglobe.com/histogram-density-for-each-column-data-frame-r
data_long <- sdata_v1[,cols] %>%
  pivot_longer(cols) %>%
  as.data.frame()

ggp2 <- ggplot(data_long, aes(x = value)) +
  geom_density() +
  facet_wrap(~ name, scales="free") +
  labs(title = "Density Plot of Predictor Variables")
ggp2
# http://www.sthda.com/english/wiki/normality-test-in-r
# From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.
shap_test_res <- lapply(sdata_v1[,cols], shapiro.test)
# Result: None of the numeric variables are normally distributed.

# https://stackoverflow.com/questions/62306712/how-to-select-only-the-p-value-0-05-after-performing-shapiro-wilk-test-in-rstud
subset_vector  <- sapply(shap_test_res, function(x) x$p.value > .05)
results_subset <- shap_test_res[subset_vector]

length(results_subset)
ggplot(data = sdata_v1) +
  geom_bar(mapping = aes(x = `BrandCode`)) +
  labs(title = "Distribution of Brand Code")
ggplot(data = sdata_v1, mapping = aes(x = `BrandCode`, y = PH)) +
  geom_boxplot() +
  labs(title = "Boxplot of Brand Code")
sdata_v1_no_nas <- sdata_v1 %>% drop_na()

bev.vsurf <- VSURF(sdata_v1_no_nas[,cols], 
                   sdata_v1_no_nas$PH,
                   ntree = 10,
                   nfor.thres = 20,
                   nfor.interp = 10, nfor.pred = 10)
bev.vsurf
summary(bev.vsurf)
bev.vsurf$varselect.pred
colnames(bev_data_no_nas[,cols])
set.seed(19372)
train_indices = createDataPartition( sdata_v1$PH , times = 1, list = FALSE, p = 0.80 )

sdata_test    = sdata_v1[-train_indices , ]
sdata_train   = sdata_v1[ train_indices , ]
sdataY_test   = sdata_v1$PH[-train_indices ]
sdataY_train  = sdata_v1$PH[ train_indices ]

nearZeroVar(sdata_train, saveMetrics = TRUE) %>% 
  kable(caption = "No Near Zero Variance Predictors in Training", digits = 2 ) %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

skim(sdata_train) %>% 
 mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , skim_variable != 'PH') %>%
  arrange( desc(zscore_extreme)) %>%
  dplyr::select(skim_variable, n_missing, complete_rate, numeric.mean, numeric.sd, numeric.hist, zscore_min, zscore_max ) %>%
  kable(digits = 2, caption = "Predictors with Outliers by Z-Score" ) %>%
  kable_styling(bootstrap_options = c("hover" , "striped"), position = "left")
pl1 = sdata_train %>% ggplot(aes(x=PH, y = MFR)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl2 = sdata_train %>% ggplot(aes(x=PH, y = OxygenFiller)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl3 = sdata_train %>% ggplot(aes(x=PH, y = Temperature)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl4 = sdata_train %>% ggplot(aes(x=PH, y = CarbRel)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl5 = sdata_train %>% ggplot(aes(x=PH, y = AirPressurer)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl6 = sdata_train %>% ggplot(aes(x=PH, y = PSCCO2)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl7 = sdata_train %>% ggplot(aes(x=PH, y = FillPressure)) + geom_point(size=0.3) + theme(aspect.ratio = 1)

plotlist= list(pl1, pl2, pl3, pl4, pl5, pl6, pl7)
plot_grid(plotlist=plotlist, nrow = 2 )


median_MFR = median( sdata_train$MFR  , na.rm = TRUE ) 

sdata_train %>% 
  mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
  mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode))  -> sdata_v2_train

sdata_test  %>% 
  mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
  mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode))  -> sdata_v2_test

edata_v1 %>%
  mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
  mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode)) ->  edata_v2



# Build the caret function to preprocess the Chemical data and impute missing values.
# There is a bug in caret which causes tibbles to be rejected.  They need to be cast as data frames.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(as.data.frame(sdata_v2_train[,3:33]) , method = c("BoxCox", "center", "scale",  "knnImpute") )

# Becomes the source data for the model building
sdata_train_pp = predict( preProcFunc,  as.data.frame(sdata_v2_train ) )

# Becomes the final version of test data for validation
sdata_test_pp  = predict( preProcFunc,  as.data.frame(sdata_v2_test) )

# Need to generate predictions based on test data without known response
edata_test_pp  = predict( preProcFunc,  as.data.frame(edata_v2) )


sdata_trainX_pp  = sdata_train_pp %>% select( -PH)
sdata_testX_pp   = sdata_test_pp %>% select(-PH)
edata_testX_pp   = edata_test_pp %>% select( -PH)

skim(sdata_train_pp)
skim(sdata_test_pp)
skim(edata_test_pp)
M = cor(sdata_v1[,2:33], use = "pairwise.complete.obs")

corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 ,  tl.cex = 0.7 )
Mpp = cor(sdata_train_pp[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 ,  tl.cex = 0.7 )


readr::write_csv(sdata_train_pp, "sdata_train_pp.csv", append = FALSE )

readr::write_csv(sdata_test_pp, "sdata_test_pp.csv", append = FALSE )

readr::write_csv(edata_test_pp, "edata_test_pp.csv", append = FALSE )


file_df = tibble( filename = c("sdata_train_pp.csv", "sdata_test_pp.csv", "edata_test_pp.csv"),
                  primary_key = c("id - consistent with sdata_test_pp", "id - consistent with sdata_train_pp", "id - standalone") ,
                  response = c("PH - unchanged", "PH - unchanged", "PH - unchanged" ) ,
                  
                  numeric_columns = c("BoxCox, center, scaled, imputed","BoxCox, center, scaled, imputed","BoxCox, center, scaled, imputed") ,
                  num_rows = c(2055, 512, 267 ) ,
                  file_source = c("StudentData.xlsx", "StudentData.xlsx", "StudentEvaluation.xlsx")
)

file_df %>% kable(caption = "Pre-Processed Data Files") %>% kable_styling(bootstrap_options = c("hover", "striped"
                                                                                            ), 
                                                                          position = "left")


sdata_train_pp = read_csv("sdata_train_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% as_tibble()

skim(sdata_train_pp) %>% yank("numeric") %>% head(n=5)

skim(sdata_train_pp) %>% yank("character") 

sdata_test_pp = read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" )) %>% as_tibble()

skim(sdata_test_pp) %>% yank("numeric") %>% head(n=5)

skim(sdata_test_pp) %>% yank("character") 


edata_test_pp = read_csv("edata_test_pp.csv", col_types = cols( .default = "?", id = "i" )) %>% as_tibble()

skim(edata_test_pp) %>% yank("numeric") %>% head(n=5)

skim(sdata_test_pp) %>% yank("character") 



toc()
## 48.625 sec elapsed